Single-support serial isomorphous replacement phasing

A simple method for serial single isomorphous replacement, which exploits natural differences in heavy-atom occupancy, is presented.


Introduction
Atomic resolution structural information is critical to our understanding of fundamental biological processes and plays an increasingly important role in the development and improvement of pharmaceuticals and chemical biology probes. Macromolecular crystallography (MX) is one of the most effective ways to obtain such information. However, MX can be limited by the phase problem (Taylor, 2003) and the necessity of growing large single crystals for data collection. Traditionally, the phasing of crystallographic data has required heavy-atom soaking or derivatization and crystal sizes of >100 mm. Working with smaller samples of 1-20 mm has many advantages, including a reduction in the time and material that are needed for crystal optimization, especially for challenging projects such as those with membrane proteins. It also offers a more uniform soaking of heavy atoms or ligands and more complete illumination in optical pump-probe experiments. The proliferation of microfocus synchrotron beamlines (Nanao et al., 2022;Hasegawa et al., 2013;Evans et al., 2007) and advanced data-collection/analysis methods has facilitated measurements from these smaller crystals; however, radiation damage makes the collection of complete, high-quality data sets from single microcrystals extremely challenging (Holton & Frankel, 2010). The answer to this problem appears to be serial/multi-crystal approaches such as synchrotron serial crystallography (SSX), in which data from many crystals are merged to produce a single data set Stellato et al., 2014;Botha et al., 2015;Zander et al., 2015;Hasegawa et al., 2017). Indeed, combining serial methods with intense microbeams has allowed the boundaries of crystal size to be pushed in recent years. Multi-crystal methods do come at a significant price, however: the natural variation between crystals ('non-isomorphism') can degrade the quality of the final merged data sets (Giordano et al., 2012), which is a particular challenge for phasing applications.
One of the earliest methods of experimental macromolecular crystallography phasing is the single isomorphous replacement (SIR) method (Crick & Magdoff, 1956;Green et al., 1954), in which data are collected from both a heavy-atomsoaked crystal and an unsoaked 'native' crystal. Differences between the intensities are used to determine the positions of the heavy atoms, which can then be used to experimentally determine phases for the native protein data. SIR offers the advantages of potentially very large differences in intensity, which can in turn provide very large phasing powers. However, its use in multi-crystal methods is complicated by both natural and heavy-atom-induced non-isomorphism. Indeed, the differences in intensities due to non-isomorphism are often larger than the signal induced by heavy-atom binding. As a result, SIR has to date been relatively uncommon in multi-crystal experiments, and the existing work has primarily been on still image data from free-electron lasers (Botha et al., 2015;Yamashita et al., 2015;Nakane et al., 2016;Zhang et al., 2015). In addition to the problem of nonisomorphism, SIR has the practical limitation that successful SIR experiments typically require the preparation and collection of diffraction data from many samples in order to identify groups of crystals for which the heavy-atom occupancies and isomorphism are high enough while also maintaining sufficient diffraction quality. This process often consumes a significant amount of manpower and beamtime.
Spatiotemporal gradients of ligand concentrations have been simulated and shown experimentally (Cole et al., 2014;Geremia et al., 2006;Pandey et al., 2021;Mizutani et al., 2014;Schmidt, 2013). We reasoned that if a population of different heavy-atom occupancies could be established, we could use a genetic algorithm (GA)-based grouping technique (Zander et al., 2016;Foos et al., 2019;Cianci et al., 2019) to distinguish derivative from native data sets. Indeed, here we report a method in which single heavy-atom soaks are performed followed by SSX data collection. A genetic algorithm is then used to group data sets that can be used to successfully determine phases experimentally by SIR.

Sample preparation
Four different kinds of protein microcrystals derivatized with different heavy atoms were analyzed. Lysozyme crystals of between 5 and 20 mm in size were grown in batch: a 40 mg ml À1 lysozyme solution was prepared in a solution consisting of 1.5 M NaCl, 0.1 M sodium acetate pH 4.6, 30% PEG 5000. Crystals of proteinase K, insulin and thermolysin were obtained using the hanging-drop vapor-diffusion method. Proteinase K crystals were prepared at 50 mg ml À1 in 50 mM HEPES pH 7.0 with a well solution consisting of 0.5-1.5 M sodium nitrate, 100 mM citrate pH 6.5. Insulin was dissolved to 15 mg ml À1 in 50 mM Na 2 HPO 4 pH 10.4 with 1 mM EDTA pH 8.0 and crystallized in 350-450 mM Na 2 HPO 4 pH 10.4, 10 mM EDTA. Thermolysin was prepared at 50 mg ml À1 in 50 mM MES pH 6.0 with 45% DMSO and the well solution consisted of 35%(w/v) ammonium sulfate dissolved in water; the crystallization drops were prepared by mixing the protein solution with the well solution in a 1:1 ratio (Marshall et al., 2012). All crystals were obtained at 20 C. Large (100-500 Å ) crystals were crushed between siliconized coverslips to obtain a range of microcrystal sizes between 5 and 20 mm. Stock solutions of Gd-HPDO3A (gadoteridol; Girard et al., 2002), mercury(II) acetate, samarium(III) nitrate and sodium iodide were made in water at 25 mM, 20 mM, 5 mM and 1 M, respectively. These stocks were added to glycerol (final concentration of 25%) and well solution to obtain soaking buffers with final heavy-atom concentrations of 2 mM, 5 mM, 667 mM and 400 mM, respectively. Microcrystalline slurries were transferred to 2 ml of these soaking solutions using 700 mm diameter micro-meshes with 10 mm openings (MiTeGen). The transfer of crystals is likely to be preferable to direct addition of heavy atoms to crystallization drops because of the competition of uncrystallized protein for heavy-atom binding. The heavy-atom soak times were 5 min, 4 min, 1 min and 30 s, respectively, based on previous experience with nonserial SIR experiments on larger crystals. Practically, soaking times can be established by setting up a sufficient quantity of slurry for multiple meshes and then removing slurry at several time points followed by harvesting on micro-meshes and flash-cooling in liquid nitrogen.

Data collection and merging
Data were collected on the fixed-energy ESRF beamline ID23-EH2 (Nanao et al., 2022) at 14.2 keV with a PILATUS3 2M detector and MD3Up diffractometer (Maatel). Data collection was performed at 100 K in MxCuBE (Oscarsson et al., 2019) using the MeshAndCollect workflow (Zander et al., 2015) (Table 1). Diffraction images and metadata (XDS input files) have been uploaded to Zenodo under ID 5111402 (https://doi.org/10.5281/zenodo.5111402). Data were initially processed automatically using XDS and Grenades (Monaco et al., 2013). The partial data set with the highest overall hI/(I)i was used as a reference data set for re-integration in XDS (Kabsch, 2010b) in order to account for indexing ambiguity. It is interesting to note that even in well behaved test cases such as these, the range of unit-cell parameters across the entire pool of data sets is generally around 1-2%, which suggests a non-negligable amount of non-isomorphism. Indeed, in their pioneering analysis of non-isomorphism, Crick & Magdoff (1956) estimated that unit-cell changes of only 0.5% lead to 15% changes in intensities of acentric reflections at 3 Å . The merging R values are generally quite high when all data are merged (Table 2).
Partial data sets were then submitted to the CODGAS (Zander et al., 2016) genetic algorithm for separation into four groups followed by scaling and merging in XSCALE (Kabsch, 2010a) (Fig. 1). The choice of the number of groups was set to a larger number than usual because of the anticipated increase in heavy-atom-induced non-isomorphism and the potential presence of both native and derivative data. The numbers of partial data sets in the native and derivative data sets are indicated in Table 2. While it would be helpful to establish a generally useful guideline for the minimum total number of partial data sets to collect in the MeshAndCollect workflow, this parameter is likely to vary as a function of the heavy-atom occupancy, diffraction resolution and symmetry. Indeed, Table 2 shows a dramatic range in the number of data sets comprising the final native and derivative data sets. It is likely that the total number of data sets that we collected was in great excess of what was necessary. When partial data sets are removed from the pool of lysozyme data, we found that as few as 20 partial data sets out of 67 could be used to determine the phases. Insulin and thermolysin phasing was successful with 75 out of 149 and 40 out of 53 data sets, respectively. However, the number of proteinase K data sets could only be reduced to 85 from the total of 91 collected. It should be noted, however, that the speed of the workflow makes the collection of 100 partial data sets quite rapid and there is therefore very little disadvantage in collecting a larger pool. Improvements to the GA could in principle further reduce the requirement for the total number of data sets.
Default parameters were used in the CODGAS target function. Execution of CODGAS was submitted to the ESRF SLURM cluster. Run times vary as a function of data-set parameters and cluster load and the specific machine that was allocated, but as an example execution took 133 min for the lysozyme data set with 67 total partial data sets on ten 2.  Program workflow for phasing. Data sets are collected from multiple crystals on a single support and indexed and integrated in XDS. These partial data sets are then submitted to CODGAS for grouping, and each group is submitted pairwise in both 'directions' to SHELXC/D/E for phasing.
Intel Xeon E5-2680 cores. The native and derivative data sets had significantly reduced ranges of unit-cell parameters compared with the ranges of the entire pool, indicating the successful identification of isomorphic groups (Table 2).

Structure solution
The resultant data sets from CODGAS were then submitted pairwise to SHELXC/D/E (Sheldrick, 2010) for substructure and phase determination by SIR (without including anomalous scattering), (Fig. 1). Because only isomorphous differences were considered in this work, there is no way to determine a priori whether one group is native or derivative. Therefore, the SIR is performed in both 'directions' for each pair (Fig. 1). Phasing success was determined by visual inspection of electron-density maps in Coot (Emsley et al., 2010) and the correlation coefficient of the automatically built partial model ('partial CC') in SHELXE. Generally, a partial CC of greater than 25% was seen as evidence of a successful structure solution, but for thermolysin some solutions with lower values (down to 18%) still yielded easily interpretable electron-density maps. Post-phasing analysis F o À F c difference maps were calculated for the proteinase K data set for each CODGAS subgroup using phases from a proteinase K model without heavy atoms. Interestingly, these maps revealed that the 'native' data set (group 3) was also partially derivitized ( Supplementary Fig. S1), but there was apparently a large enough difference in the heavy-atom occupancies between this group and group 2 to determine the phases experimentally. The peak heights for the native and derivative were 80 and 48 standard deviations above the mean value. Analysis of F o À F c maps in the other systems also revealed heavy atoms in the 'native' data sets. Native versus derivative peak heights for thermolysin, insulin and lysozyme were 43 versus 51, 31 versus 37 and 29 versus 36 standard deviations above the mean, respectively. Merging statistics for the successful native and derivative data sets are shown in Table 2. Segregation of native and isomorphous data sets can be used for SIR phasing in lysozyme Gd (upper panel), proteinase K Hg (middle panel) and thermolysin Sm (lower panel). Algorithm progress is shown on the x axis and the partial CC is shown on the y axis. Representative electron density from SHELXE is shown on the right at 1.5. The figure was produced using ggplot2 (https://ggplot2.tidyverse.org/), R (https://www.r-project.org/) and PyMOL (Schrö dinger). Table 2 Statistics for all data, native and derivative data sets, and partial data sets.
Values in parentheses are for the outer shell. Note that some partial data sets were not assigned to either native or derivative groups.
Examination of intermediate generations of the GA trajectory reveals a progressive enrichment of successful phasing results as a function of algorithm progress (Fig. 2). In contrast, iodinesoaked cubic insulin was not readily solved in the same manner (Fig. 3a, upper panel). Because the segregation of groups is dependent on both merging statistics as well as research papers Acta Cryst. (2022). D78, 716-724 Foos, Rizk and Nanao Single-support SIR phasing 721 Figure 3 (a) Improvement of phasing success with the introduction of an isomorphous term in the genetic algorithm fitness function for insulin I. The frequency of the CC of the partial model is shown for w iso of 0, 1, 10, 100, 1000 and 10 000. Average chain lengths of <11 residues per chain are shown in red and those of !11 are shown in cyan. (b) Experimental electron density from SHELXE contoured at 1.5. algorithmic parameters, we submitted multiple CODGAS runs varying both. However, changing the relative weights of the GA target function terms and the number of GA generations or the population size did not yield any improvements. While there are practical limitations to CODGAS parameter space, exploring it in even a fractional factorial approach can be quite time-and compute-intensive. This, coupled with the fact that not even modest improvements were observed, prompted us to adopt a different approach. We reasoned that a modification of the target function to include some metric of isomorphism might aid in group identification. We therefore introduced an additional term to the GA target function. In a classical SIR experiment, it is common to examine the merging R value for both the native and derivative data sets and compare it against an R value between the two data sets (and confirm that the absolute value of the R value between the data sets is not excessively high). This analysis gives the user an idea of the amount of signal and noise present in the experiment. We encoded a simple version of this heuristic analysis in a new term, based on the ratio of the intra:interdata-set R values, where R int = P F 2 o À hF 2 o ij= P F 2 o as calculated by SHELXC, R individual_average is the average inner shell R meas , as calculated by XDS, and w iso is the weight associated with this term. This term was added to the previously described fitness term to produce R + I + CC + C + M + ISO, where R = (100 À R meas overall)w R , I = hI/(I)i overall w I/(I) , CC = CC 1=2 overall w CC 1=2 , C = completeness overall w completeness , M = multiplicity overall w multiplicity and ISO is as defined above. We then performed the GA optimization with w iso = 1, 10, 100, 1000 and 10 000. Because GAs rely on a pseudo-random initialization of the population, in order to eliminate any effects due to different starting conditions CODGAS was modified in order to run with an explicitly set random-number seed. This seed is then used by the underlying GA code library (DEAP; https:// deap.readthedocs.io/en/master/). Run in this manner, varying the w iso term dramatically increased the number of successful structure solutions (Fig. 3). Values of w iso of greater than 10 produced the same results, suggesting that the weighting between this term and the other GA terms is not especially critical.

Summary and outlook
Here, we apply recent analysis methods to single isomorphous replacement, resulting in a method with unique advantages. This method can be performed using data from a single heavyatom soak and sample holder, dramatically simplifying the SIR experiment. Sample preparation is followed by data collection using existing automated workflows such as MeshAndCollect (Zander et al., 2015). Such a data-collection strategy requires some method to separate native from derivative data sets. To this end, we have used the CODGAS GA, and indeed have demonstrated that such an approach can be used to identify two groups of internally isomorphous data sets and that the intensity differences between these data sets can be successfully used for de novo phase determination by SIR. It should be noted that we have used well behaved test systems, and it remains to be seen what the limits of this method are, particularly with respect to minimum resolution and lower symmetry.
Several improvements are already envisaged. The current target function applies only to merging statistics, but it is also possible that using metrics from downstream phasing steps could also be used. For example, an initial attempt at using SHELXD substructure solutions has been investigated. However, a metric of substructure correctness that is suitable for the target function has not yet been identified. The typically used CC(all) and CC(weak) metrics, for example, do not appear to offer sufficient discrimination between spurious and real solutions. Furthermore, there is a significant computational cost associated with this method. In this work, we have focused purely on isomorphous phasing, but by combining serial anomalous scattering (Melnikov et al., 2017) with SIR (SIRAS) the success rate could also be improved, and this is currently being studied. The anomalous signal, where present, could also be used to establish which data set is native and which is derivative. However, strong anomalous signal is not always available, depending on the element and beamline properties.
In this work, we have largely ignored radiation-damage effects by using relatively low doses. In some cases, specific radiation damage can be used for phasing (Banumathi et al., 2004;Nanao et al., 2005;Schiltz et al., 2004;Ravelli et al., 2003;Nanao & Ravelli, 2006;de Sanctis & Nanao, 2012). This technique can be loosely viewed as an 'inverted' SIR experiment. We have previously shown that radiation-damageinduced phasing is possible in serial experiments (Foos et al., 2018). This work employed a modified MeshAndCollect workflow which repeatedly collected data from the same crystals in order to obtain high-and low-dose data sets. However, it is also possible that differential radiation damage between crystals could be used in an analogous way to the gradient of heavy-atom occupancies used here. This would remove the requirement for multiple collections from the same crystals.
The suitability of cluster analysis (CA) based on correlations on intensities and or unit-cell parameters (Giordano et al., 2012;Santoni et al., 2017;Foadi et al., 2013;Liu et al., 2011) or more sophisticated approaches using XSCALE_ ISOCLUSTER and XDSCC12 (Assmann et al., 2020) has not yet been studied for SIR. However, it is possible that the GA and CA approaches could be complementary or indeed combined. For example, pre-grouping data with CA followed by fine-tuning in the GA could improve the separation and quality of the 'native' and 'derivative' data sets. Because the 'native' data sets contain some heavy atoms, there is clearly room for improvement in this regard.
Finally, while all systems were readily solved, the distribution of heavy-atom occupancies, which is related to the binding kinetics and crystal size, is likely to be a critical factor in the success of this technique. We have employed relatively research papers gentle (short incubation time, low concentrations) heavy-atom soaking protocols in this study. However, the distribution of heavy-atom occupancies could perhaps be improved by varying the crystal sizes, beam sizes, heavy-atom concentrations and soak times. Nevertheless, we have demonstrated an extremely accessible experimental phasing protocol with associated computational analysis tools to reinvigorate the routine use of SIR in MX experiments.

Related literature
The following reference is cited in the supporting information for this article: Adams et al. (2010).